Zed

Univariate Plots Section

Tip: This is Financial Contributions to Presidential Campaigns in WA data set.

## 'data.frame':    292317 obs. of  9 variables:
##  $ cand_nm          : Factor w/ 24 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 19 19 4 4 22 19 4 19 19 ...
##  $ contbr_nm        : Factor w/ 49817 levels "'CALL, ALLAN",..: 11139 23477 26330 15549 13027 34466 26698 20215 24250 24250 ...
##  $ contbr_city      : Factor w/ 719 levels "","1644 WINDEMERE DR.",..: 495 560 560 301 560 630 321 576 243 243 ...
##  $ contbr_zip       : Factor w/ 53472 levels "","00159","00160",..: 38469 12154 22416 7522 16517 39786 36566 24929 35810 35810 ...
##  $ contbr_employer  : Factor w/ 13373 levels "","''I LIKE COMICS''",..: 1 7617 10388 1 4979 5404 7761 10063 7761 7761 ...
##  $ contbr_occupation: Factor w/ 8260 levels "","-"," CERTIFIED REGISTERED NURSE ANESTHETIS",..: 2111 4693 7485 6164 2160 3514 4693 5707 4693 4693 ...
##  $ contb_receipt_amt: num  25 27 50 55 18.9 ...
##  $ contb_receipt_dt : Factor w/ 671 levels "01-APR-15","01-APR-16",..: 505 78 121 414 350 212 121 414 78 121 ...
##  $ election_tp      : Factor w/ 5 levels "","G2016","O2016",..: 4 4 4 4 4 2 4 4 4 4 ...
##                       cand_nm                    contbr_nm     
##  Clinton, Hillary Rodham  :126190   BUNSON, JAMIE     :   282  
##  Sanders, Bernard         :121555   TREIBEL, RANDY    :   280  
##  Trump, Donald J.         : 16222   BUCKLEY, MARK     :   262  
##  Cruz, Rafael Edward 'Ted': 13357   WATSON, DONNA     :   256  
##  Carson, Benjamin S.      :  8085   BENACK, MARY ANN  :   214  
##  Rubio, Marco             :  2192   SOMERVILLE, DALENE:   191  
##  (Other)                  :  4716   (Other)           :290832  
##      contbr_city         contbr_zip          contbr_employer  
##  SEATTLE   : 82399   98053    :   286                : 42728  
##  VANCOUVER :  9555   981556413:   161   NONE         : 28069  
##  OLYMPIA   :  8965   98033    :   156   RETIRED      : 26858  
##  BELLINGHAM:  8105   98004    :   152   SELF-EMPLOYED: 15757  
##  TACOMA    :  8075   981882718:   145   NOT EMPLOYED : 15214  
##  BELLEVUE  :  7942   991633631:   133   SELF         :  8269  
##  (Other)   :167276   (Other)  :291284   (Other)      :155422  
##              contbr_occupation  contb_receipt_amt   contb_receipt_dt 
##  RETIRED              : 58036   Min.   :-8432.99   31-MAR-16:  3443  
##  NOT EMPLOYED         : 38766   1st Qu.:   15.00   29-FEB-16:  3323  
##  INFORMATION REQUESTED:  6361   Median :   27.00   31-MAY-16:  2576  
##  SOFTWARE ENGINEER    :  5494   Mean   :   82.95   30-MAR-16:  2549  
##  TEACHER              :  5035   3rd Qu.:   60.00   30-APR-16:  2540  
##  ATTORNEY             :  4753   Max.   :10800.00   09-MAR-16:  2312  
##  (Other)              :173872                      (Other)  :275574  
##  election_tp   
##       :   364  
##  G2016: 89381  
##  O2016:   109  
##  P2016:202462  
##  P2020:     1  
##                
## 

In 2016 Presidential Campaign Finance, contributor raw data in WA have 292317 observations and 19 variables, each observation indicates a donation transaction. And then I delete 10 columns by python. So the data import to R have 9 variables and 292317 observations.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -8432.99    15.00    27.00    82.95    60.00 10800.00

summary the contb_receipt_amt and find that there are some minus value. Obviously, we need to delete these observations.

## [1] 289340      9

now the data set have 289340 observations.

obviously, most donate amount are small and there are some extremely high amount value outliers. Consequently, it is neccessary to use log10 transform for better distribution histogram.

## 
##  Shapiro-Wilk normality test
## 
## data:  log_normal_sample
## W = 0.97459, p-value < 2.2e-16

## 80% 
## 100

we can see that 80% of the donates are less than 100$.

And then, I need to add more variables to explore potential interesting pattern.

## 
## female   male 
## 148058 128403

number of male and female contributors

## 
## female   male 
##      3     21
## 
##   democrat     others republican 
##          5          3         16

number of male and female candidates number of party candidates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0    14.0   403.5   133.0 81773.0
## [1] 717   2
##  95% 
## 1423

There are 717 different city because of misspelling, shorten or long detail name. One solution is to use zipcode to cross-validate.

No surprise, the outstanding bar is Seattle.

## 
##  Shapiro-Wilk normality test
## 
## data:  log_normal_sample
## W = 0.91347, p-value < 2.2e-16

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     1.00     4.00    35.09    11.00 57491.00
## [1] 8245    2
## 95% 
##  59

There are 8260 different occupation! 95% of them have less than 59 people.

## 
##  Shapiro-Wilk normality test
## 
## data:  log_normal_sample
## W = 0.90732, p-value < 2.2e-16

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    16.0    93.0   447.9   538.2  7654.0
## [1] 646   2

There 646 different zipcode.

## 
##  Shapiro-Wilk normality test
## 
## data:  log_normal_sample
## W = 0.97488, p-value = 4.424e-09

Moreover, I add additional variables about demographic data to the dataset.

## 
##  Shapiro-Wilk normality test
## 
## data:  normal_sample
## W = 0.98395, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  normal_sample
## W = 0.94319, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  normal_sample
## W = 0.98174, p-value < 2.2e-16

## 
##  Shapiro-Wilk normality test
## 
## data:  normal_sample
## W = 0.94666, p-value < 2.2e-16

Univariate Analysis

What is the structure of your dataset?

There are 289405 observations of 22 variables.
There are 9 original variables:
* cand_nm
* contbr_nm
* contbr_zip
* contbr_city
* contbr_employer
* contbr_occupation
* contb_receipt_amt
* contb_receipt_dt
* election_tp

There are 5 derived variables:
* party
* cand_first_nm
* cand_gender
* contbr_first_nm
* contbr_gender

There are 8 demographic variables coming from df_zip_demographics data set:
* total_population
* percent_white
* percent_black
* percent_asian
* percent_hispanic
* per_capita_income
* median_rent
* median_age

There are a lot of categorical but no ordered factor.

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the data set are contbr_amt, party, cand_name, contbr_gender, total_population, per_capita_income, percent_white and median_age. I hope these variables can be used to build a predictive model to predict contribution amount to which party.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think contbr_zip, contbr_occupation, contb_receipt_dt, cand_gender, median_rent, percent_black, ercent_asian, percent_hispanic, election_tp are all relevant to contribution amount.

Did you create any new variables from existing variables in the dataset?

There are 5 derived variables:
* party
* cand_first_nm
* cand_gender
* contbr_first_nm
* contbr_gender

cand_name >- party
cand_name >- cand_first_nm >- cand_gender
contbr_nm >- contbr_first_nm >- contbr_gender

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The contribution have amount less than 0 and larger than 2700, which is illegal. I just remove these rows.

The largest three group of contributors by occupation are retirees, NOT EMPLOYED and INFORMATION REQUESTED.

The election type have one blank type, no idea what it is.

There are 717 different city because of misspelling, shorten or long detail name. One solution is to use zipcode to cross-validate.

There are 8260 different occupation! 95% of them have less than 59 people.

Bivariate Plots Section

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                   contb_receipt_amt election_tp      party contbr_gender
## contb_receipt_amt                 1  Polyserial Polyserial    Polyserial
## election_tp                 -0.1045           1 Polychoric    Polychoric
## party                        0.1345      0.1811          1    Polychoric
## contbr_gender               0.05219      0.1719     0.2495             1
## cand_gender                -0.08513      0.7649     0.7235        0.3392
## total_population           -0.01675    -0.02927   -0.08306     -0.004154
## percent_white             -0.001675     0.04469    0.08542      -0.03774
## percent_black              -0.00345    -0.03836    -0.1544       0.03451
## percent_asian               0.04366    -0.09241    -0.1807       0.02302
## percent_hispanic           -0.02905     0.02757      0.146       0.02988
## per_capita_income            0.1326     -0.1275    -0.2163      0.006958
## median_rent                 0.09303     -0.1078    -0.1644     -0.001523
## median_age               -0.0006057     0.02908    0.05621      -0.04872
## time                       -0.01964     -0.9579    -0.2709       -0.1579
##                   cand_gender total_population percent_white percent_black
## contb_receipt_amt  Polyserial          Pearson       Pearson       Pearson
## election_tp        Polychoric       Polyserial    Polyserial    Polyserial
## party              Polychoric       Polyserial    Polyserial    Polyserial
## contbr_gender      Polychoric       Polyserial    Polyserial    Polyserial
## cand_gender                 1       Polyserial    Polyserial    Polyserial
## total_population     -0.04162                1       Pearson       Pearson
## percent_white         0.06101          -0.3161             1       Pearson
## percent_black        -0.08147           0.1389       -0.7777             1
## percent_asian         -0.1408             0.31       -0.7548          0.56
## percent_hispanic      0.08144           0.1994       -0.5314        0.1567
## per_capita_income     -0.2141         -0.05897        0.1628       -0.1208
## median_rent           -0.1678           0.1782      -0.03041      -0.07032
## median_age            0.03728           -0.523        0.4226       -0.2529
## time                  -0.7313          0.02878      -0.03094       0.03218
##                   percent_asian percent_hispanic per_capita_income
## contb_receipt_amt       Pearson          Pearson           Pearson
## election_tp          Polyserial       Polyserial        Polyserial
## party                Polyserial       Polyserial        Polyserial
## contbr_gender        Polyserial       Polyserial        Polyserial
## cand_gender          Polyserial       Polyserial        Polyserial
## total_population        Pearson          Pearson           Pearson
## percent_white           Pearson          Pearson           Pearson
## percent_black           Pearson          Pearson           Pearson
## percent_asian                 1          Pearson           Pearson
## percent_hispanic       -0.00376                1           Pearson
## per_capita_income        0.2319           -0.409                 1
## median_rent              0.4367          -0.3276            0.7616
## median_age              -0.2799          -0.3175            0.1243
## time                    0.06582         -0.02828           0.08421
##                   median_rent median_age       time
## contb_receipt_amt     Pearson    Pearson    Pearson
## election_tp        Polyserial Polyserial Polyserial
## party              Polyserial Polyserial Polyserial
## contbr_gender      Polyserial Polyserial Polyserial
## cand_gender        Polyserial Polyserial Polyserial
## total_population      Pearson    Pearson    Pearson
## percent_white         Pearson    Pearson    Pearson
## percent_black         Pearson    Pearson    Pearson
## percent_asian         Pearson    Pearson    Pearson
## percent_hispanic      Pearson    Pearson    Pearson
## per_capita_income     Pearson    Pearson    Pearson
## median_rent                 1    Pearson    Pearson
## median_age           -0.08682          1    Pearson
## time                  0.06859   -0.02677          1

contb_receipt_amt and party have 0.135 correlation
contb_receipt_amt and election_tp have -0.105 correlation
contb_receipt_amt and per_capital_income have 0.133 correlation
cand_gender and contbr_gender have 0.339 correlation
cand_gender and per_capita_income have -0.214 correlation
party and per_capita_income have -0.216 correlation
party and percent_black have -0.154 correlation
party and percent_asian have -0.181 correlation
party and percent_hispanic have 0.146 correlation
total_population and median_age have -0.523 correlation
total_popultation and percent_white have -0.316 correlation
total_popultation and percent_black 0.139 correlation
total_popultation and percent_asian have 0.31 correlation
total_popultation and percent_hispanic have 0.199 correlation
percent_white and percent_black have -0.777 correlation
percent_white and percent_asian have -0.755 correlation
percent_white and percent_hispanic have -0.531 correlation
percent_asian and per_capita_income have 0.232 correlation
percent_asian and median_rent have 0.437 correlation
percent_asian and median_age have -0.28 correlation
percent_hispanic and median_rent have -0.328 correlation
percent_hispanic and median_age have -0.318 correlation
percent_hispanic and per_capita_income have -0.409 correlation
per_capita_income and median_rent have 0.762 correlation

Some interesting demographic relation.

It shows that people have around 70000 income make the most contribution. The median contribution peak at low income seems very strange.

It shows that people about 40 make the most contribution, but the people about 30 make the highest median contribution. There also two other peaks at 25 and 60 in median amount plot.

In addition, I Group the contb_receipt_amt by contribution by population structure.

This plot is ridiculous. It cannot make sense.

People make more and more contribution when the election day and other big day are closer and closer. I wonder what that valley is.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

contb_receipt_amt and party have 0.135 correlationorrelation. contb_receipt_amt and election_tp have -0.105 correlationorrelation.
contb_receipt_amt and per_capital_income have 0.133 correlationorrelation.

Democrat’s total contribution is more than Republican, but Replublican’s mean contribution is more than Democrat. People have around 70000 income make the most contribution. Male make more contribution than female on both sum and mean, but not much. There is some high contribution outlier at low income, which seems very weird.
people about 40 make the most contribution, but the people about 30 make the highest median contribution. There is one weird peak at about 22 in sum contribution.
There are also two other peaks at 25 and 60 in median contribution. People make more and more contribution when the election day and other big day are closer and closer.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

cand_gender and contbr_gender have 0.339 correlation
cand_gender and per_capita_income have -0.214 correlation
party and per_capita_income have -0.216 correlation
party and percent_black have -0.154 correlation
party and percent_asian have -0.181 correlation
party and percent_hispanic have 0.146 correlation
total_population and median_age have -0.523 correlation
total_popultation and percent_white have -0.316 correlation
total_popultation and percent_black 0.139 correlation
total_popultation and percent_asian have 0.31 correlation
total_popultation and percent_hispanic have 0.199 correlation
percent_white and percent_black have -0.777 correlation
percent_white and percent_asian have -0.755 correlation
percent_white and percent_hispanic have -0.531 correlation
percent_asian and per_capita_income have 0.232 correlation
percent_asian and median_rent have 0.437 correlation
percent_asian and median_age have -0.28 correlation
percent_hispanic and median_rent have -0.328 correlation
percent_hispanic and median_age have -0.318 correlation
percent_hispanic and per_capita_income have -0.409 correlation
per_capita_income and median_rent have 0.762 correlation

What was the strongest relationship you found?

percent_white and percent_black have -0.777 correlation, even higher than 0.762 correlation between per_capita_income and median_rent have

Multivariate Plots Section

Both Trump and Hillary are more welcome by male rather than female.

Moreover, I try to build a linear model to predict contribution.But it seems not good.

## 
## Calls:
## m1: lm(formula = contb_receipt_amt ~ per_capita_income, data = wa)
## m2: lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender, 
##     data = wa)
## m3: lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender + 
##     party, data = wa)
## m4: lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender + 
##     party + median_age, data = wa)
## m5: lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender + 
##     party + median_age + percent_white, data = wa)
## 
## =====================================================================================================
##                           m1              m2              m3              m4              m5         
## -----------------------------------------------------------------------------------------------------
##   (Intercept)         -9.646***       -18.461***      -36.857***       -5.524          10.220**      
##                       (1.403)          (1.489)         (1.517)         (3.112)         (3.349)       
##   per_capita_income    0.003***         0.003***        0.003***        0.003***        0.003***     
##                       (0.000)          (0.000)         (0.000)         (0.000)         (0.000)       
##   contbr_gendermale                    20.036***       12.929***       12.442***       12.146***     
##                                        (0.929)         (0.931)         (0.932)         (0.932)       
##   partyothers                                         174.332***      175.027***      175.156***     
##                                                        (6.814)         (6.813)         (6.811)       
##   partyrepublican                                      71.136***       72.050***       73.165***     
##                                                        (1.329)         (1.331)         (1.334)       
##   median_age                                                           -0.848***       -0.422***     
##                                                                        (0.074)         (0.081)       
##   percent_white                                                                        -0.464***     
##                                                                                        (0.037)       
## -----------------------------------------------------------------------------------------------------
##   R-squared                   0.018           0.019           0.031           0.032           0.033  
##   adj. R-squared              0.018           0.019           0.031           0.032           0.033  
##   sigma                     242.979         241.651         240.144         240.086         240.015  
##   F                        5113.117        2676.403        2213.295        1798.083        1526.156  
##   p                           0.000           0.000           0.000           0.000           0.000  
##   Log-likelihood       -1970336.620    -1881166.331    -1879460.552    -1879394.094    -1879313.501  
##   Deviance          16829704435.971 15905510725.251 15707535943.265 15699872892.314 15690584822.102  
##   AIC                   3940679.240     3762340.662     3758933.104     3758802.189     3758643.001  
##   BIC                   3940710.922     3762382.722     3758996.193     3758875.793     3758727.121  
##   N                      285064          272379          272379          272379          272379      
## =====================================================================================================

That is very interesting! There is strong and interesting relation between contributor occupation and candidate’s backgroud.

Hillary’s contribution went down to a valley after email controversy and then went up. After the peak at around first week in October, the contribution slump down. Cannot figure out why. Trump’s contribution is steadily low compared with Hillary.

Almost the same trend regardless of per capita income!

furthermore, I subset the map data for WA.

If democrat total contribution more than republican, the whole zipcode area will be filled with blue otherwise red.The map shows that Democrat and republican seems matched.

The map shows that republican receive more contributions by zipcode.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Hillary get a lot of big pocket money contribution while the democratic nomination date coming closer. Hillary’s contribution went down to a valley after email controversy and then went up. After the peak at around first week in October, the contribution slump down. Cannot figure out why. Trump’s contribution is steadily low compared with Hillary. Though finally Hillary win in Washington State, it is hard to say there is really some strong relation between contribution and vote.

Were there any interesting or surprising interactions between features?

Both Trump and Hillary are more welcome by male rather than female. Female get less contributions than male.

OPTIONAL: Did you create any models with your dataset? Discuss the

## 
## Call:
## lm(formula = contb_receipt_amt ~ per_capita_income + contbr_gender + 
##     party + median_age + percent_white, data = wa)
## 
## Coefficients:
##       (Intercept)  per_capita_income  contbr_gendermale  
##         10.219845           0.002914          12.145989  
##       partyothers    partyrepublican         median_age  
##        175.155750          73.164645          -0.422120  
##     percent_white  
##         -0.464180

Final Plots and Summary

Plot One

Description One

After faceting by age, the plot shows that Hillary is more welcomed by female rather than male.
Sanders is welcomed by all age group no matter male or female.

It seems that trump have good contribution distribution.

Plot Two

Description Two

After working all night, finnaly I add annotation to different facet at different coordination.

Hillary’s contribution went down to a valley after email controversy and then went up. After the peak at around first week in October, the contribution slump down. Cannot figure out why. Trump’s contribution is steadily low compared with Hillary. Though finally Hillary win in Washington State, it is hard to say there is really some strong relation between contribution and vote.

Plot Three

Description Three

Though Hillary have more contribution, it seems Trump’s contribution have broader spread. But there are too many blank district. It is hard to get a conclution.

Plot Four

This is the most exicting plot in the whole analysis. The plot shows strong and interesting relation between contributor occupation and candidate’s backgroud.

As for Hillary, the top 1 occupation is attorney, and the 11th is lawyer. Yeah, we all know that Hillary once belong to this group The second is homemaker, it tells that Hillary really welcomed by female. We can also see that Hillary is welcomed by not employed and educate industry.

As for Trump, the top 1 occupation is self-employed. That is amazing. Perhaps Trump have some character appreciated by self-employed, such as courage. And then we can see that in top 10, there are CEO, president, business owner, owner. What is more, Trump also have a group donator with occupation like contractor, project manager, real estate.

The Hillary is supported by nurse while Trump is supported by farmer. # Reflection

The contribution map shows that many district have no contribution at all. I don’t know this is because of data quality or that is the truth. Maybe Washington State is not a good data set to analyze election compaign.

It is hard to figure out strong relation between election result and contribution since Trump paid himself.

There is really strong relation between contribution and date.

The strongest relation is candidate and their donator’s occupation, which can be indicated by bar plot but not correlation.

Since contribution have limit and influence by many factor, I think building model to predict contribution is nearly impossible.

The data quality is not good enough and even missing due to different expreesion, manual error and some other unknown reason.

It would be helpful to import some vote, demographic and geographic data to cross-validate and supply the election contribution data set.

I think using the larger dataset like the whole USA data set would discover some more interesting relaion.